{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regularization of linear regression model\n",
    "\n",
    "In this notebook, we explore some limitations of linear regression models and\n",
    "demonstrate the benefits of using regularized models instead. Additionally, we\n",
    "discuss the importance of scaling the data when working with regularized\n",
    "models, especially when tuning the regularization parameter.\n",
    "\n",
    "We start by highlighting the problem of overfitting that can occur with a\n",
    "simple linear regression model.\n",
    "\n",
    "## Effect of regularization\n",
    "\n",
    "We load the Ames housing dataset. We retain some specific\n",
    "`features_of_interest`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "ames_housing = pd.read_csv(\"../datasets/ames_housing_no_missing.csv\")\n",
    "features_of_interest = [\n",
    "    \"LotFrontage\",\n",
    "    \"LotArea\",\n",
    "    \"PoolArea\",\n",
    "    \"YearBuilt\",\n",
    "    \"YrSold\",\n",
    "]\n",
    "target_name = \"SalePrice\"\n",
    "data, target = (\n",
    "    ames_housing[features_of_interest],\n",
    "    ames_housing[target_name],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In one of the previous notebooks, we showed that linear models could be used\n",
    "even when there is no linear relationship between the `data` and `target`.\n",
    "For instance, one can use the `PolynomialFeatures` transformer to create\n",
    "additional features that capture some non-linear interactions between them.\n",
    "\n",
    "Here, we use this transformer to augment the feature space. Subsequently, we\n",
    "train a linear regression model. We use cross-validation with\n",
    "`return_train_score=True` to evaluate both the train scores and the\n",
    "generalization capabilities of our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_validate\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "linear_regression = make_pipeline(\n",
    "    PolynomialFeatures(degree=2, include_bias=False), LinearRegression()\n",
    ").set_output(transform=\"pandas\")\n",
    "cv_results = cross_validate(\n",
    "    linear_regression,\n",
    "    data,\n",
    "    target,\n",
    "    cv=10,\n",
    "    scoring=\"neg_mean_squared_error\",\n",
    "    return_train_score=True,\n",
    "    return_estimator=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compare the mean squared error on the training and testing set to\n",
    "assess the generalization performance of our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_error = -cv_results[\"train_score\"]\n",
    "print(\n",
    "    \"Mean squared error of linear regression model on the train set:\\n\"\n",
    "    f\"{train_error.mean():.2e} \u00b1 {train_error.std():.2e}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_error = -cv_results[\"test_score\"]\n",
    "print(\n",
    "    \"Mean squared error of linear regression model on the test set:\\n\"\n",
    "    f\"{test_error.mean():.2e} \u00b1 {test_error.std():.2e}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training error is in average one order of magnitude lower than the testing\n",
    "error (lower error is better). Such a gap between the training and testing\n",
    "scores is an indication that our model overfitted the training set. Indeed,\n",
    "this is one of the dangers when augmenting the number of features with a\n",
    "`PolynomialFeatures` transformer. For instance, one does not expect features\n",
    "such as `PoolArea * YrSold` to be very predictive.\n",
    "\n",
    "To analyze the weights of the model, we can create a dataframe. The columns of\n",
    "the dataframe contain the feature names, while the rows store the coefficients\n",
    "of each model of a given cross-validation fold.\n",
    "\n",
    "In order to obtain the feature names associated with each feature combination,\n",
    "we need to extract them from the augmented data created by\n",
    "`PolynomialFeatures`. Fortunately, scikit-learn provides a convenient method\n",
    "called `feature_names_in_` for this purpose. Let's begin by retrieving the\n",
    "coefficients from the model fitted in the first cross-validation fold."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_first_fold = cv_results[\"estimator\"][0]\n",
    "model_first_fold"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can access the fitted `LinearRegression` (step `-1` i.e. the last step\n",
    "of the `linear_regression` pipeline) to recover the feature names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_names = model_first_fold[-1].feature_names_in_\n",
    "feature_names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code creates a list by iterating through the estimators and\n",
    "querying their last step for the learned `coef_`. We can then create the\n",
    "dataframe containing all the information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "coefs = [est[-1].coef_ for est in cv_results[\"estimator\"]]\n",
    "weights_linear_regression = pd.DataFrame(coefs, columns=feature_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's use a box plot to see the coefficients variations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n",
    "fig, ax = plt.subplots(figsize=(10, 10))\n",
    "weights_linear_regression.plot.box(color=color, vert=False, ax=ax)\n",
    "_ = ax.set(title=\"Linear regression weights (linear scale)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By looking at the bar plot above it would seem that most of the features are\n",
    "very close to zero, but this is just an effect of visualizing them on the same\n",
    "scale as the extremely large span of `\"YrSold\"`. Instead we can use a\n",
    "symmetric log scale for the plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n",
    "fig, ax = plt.subplots(figsize=(10, 10))\n",
    "weights_linear_regression.plot.box(color=color, vert=False, ax=ax)\n",
    "_ = ax.set(\n",
    "    title=\"Linear regression weights (symmetric log scale)\",\n",
    "    xscale=\"symlog\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Observe that some coefficients are extremely large while others are extremely\n",
    "small, yet non-zero. Furthermore, the coefficient values can be very unstable\n",
    "across cross-validation folds.\n",
    "\n",
    "We can force the linear regression model to consider all features in a more\n",
    "homogeneous manner. In fact, we could force large positive or negative\n",
    "weights to shrink toward zero. This is known as regularization. We use a\n",
    "ridge model which enforces such behavior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Ridge\n",
    "\n",
    "ridge = make_pipeline(\n",
    "    PolynomialFeatures(degree=2, include_bias=False),\n",
    "    Ridge(alpha=100, solver=\"cholesky\"),\n",
    ")\n",
    "cv_results = cross_validate(\n",
    "    ridge,\n",
    "    data,\n",
    "    target,\n",
    "    cv=20,\n",
    "    scoring=\"neg_mean_squared_error\",\n",
    "    return_train_score=True,\n",
    "    return_estimator=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code cell above can generate a couple of warnings (depending on the\n",
    "choice of solver) because the features included both extremely large and\n",
    "extremely small values, which are causing numerical problems when training\n",
    "the predictive model. We will get to that in a bit.\n",
    "\n",
    "Let us evaluate the train and test scores of this model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_error = -cv_results[\"train_score\"]\n",
    "print(\n",
    "    \"Mean squared error of ridge model on the train set:\\n\"\n",
    "    f\"{train_error.mean():.2e} \u00b1 {train_error.std():.2e}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_error = -cv_results[\"test_score\"]\n",
    "print(\n",
    "    \"Mean squared error of ridge model on the test set:\\n\"\n",
    "    f\"{test_error.mean():.2e} \u00b1 {test_error.std():.2e}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the training and testing scores get closer, indicating that our\n",
    "model is less overfitting (yet still overfitting!). We can compare the values\n",
    "of the weights of ridge with the un-regularized linear regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coefs = [est[-1].coef_ for est in cv_results[\"estimator\"]]\n",
    "weights_ridge = pd.DataFrame(coefs, columns=feature_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(8, 10))\n",
    "weights_ridge.plot.box(color=color, vert=False, ax=ax)\n",
    "_ = ax.set(title=\"Ridge regression weights\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the overall magnitudes of the weights are shrunk\n",
    "(yet non-zero!) with respect to the linear regression model. If you want to,\n",
    "feel free to use a symmetric log scale in the previous plot.\n",
    "\n",
    "You can also observe that even if the weights' values are less extreme, they\n",
    "are still unstable from one fold to another. Even worst, the results can vary\n",
    "a lot depending on the choice of the solver (for instance try to set\n",
    "`solver=\"saga\"` or `solver=\"lsqr\"` instead of `solver=\"cholesky\"` and re-run\n",
    "the above cells).\n",
    "\n",
    "In the following we attempt to resolve those remaining problems, by\n",
    "focusing on two important aspects we omitted so far:\n",
    "- the need to **scale the data**, and\n",
    "- the need to **search for the best regularization parameter**.\n",
    "\n",
    "## Feature scaling and regularization\n",
    "\n",
    "On the one hand, weights define the association between feature values and the\n",
    "predicted target, which depends on the scales of both the feature values and\n",
    "the target. On the other hand, regularization adds constraints on the weights\n",
    "of the model through the `alpha` parameter. Therefore, the effect that feature\n",
    "rescaling has on the final weights also interacts with the use of\n",
    "regularization.\n",
    "\n",
    "Let's consider the case where features live on the same scale/units: if two\n",
    "features are found to be equally important by the model, they are affected\n",
    "similarly by the regularization strength.\n",
    "\n",
    "Now, let's consider the scenario where two features have completely different\n",
    "data scales (for instance age in years and annual revenue in dollars). Let's\n",
    "also assume that both features are approximately equally predictive and are\n",
    "not too correlated. Fitting a linear regression without scaling and without\n",
    "regularization would give a higher weight to the feature with the smallest\n",
    "natural scale. If we add regularization, the feature with the smallest natural\n",
    "scale would be penalized more than the other feature. This is not desirable\n",
    "given the hypothesis that both features are equally important. In such case we\n",
    "require the regularization to stay neutral.\n",
    "\n",
    "In practice, we don't know ahead of time which features are predictive, and\n",
    "therefore we want regularization to treat all features approximately equally\n",
    "by default. This can be achieved by rescaling the features.\n",
    "\n",
    "Furthermore, many numerical solvers used internally in scikit-learn behave\n",
    "better when features are approximately on the same scale. Heterogeneously\n",
    "scaled data can be detrimental when solving for the optimal weights (hence the\n",
    "warnings we tend to get when fitting linear models on raw data). Therefore,\n",
    "when working with a linear model and numerical data, it is generally a good\n",
    "practice to scale the data.\n",
    "\n",
    "Thus, we add a `MinMaxScaler` in the machine learning pipeline, which scales\n",
    "each feature individually such that its range maps into the range between zero\n",
    "and one. We place it just before the `PolynomialFeatures` transformer as\n",
    "powers of features in the range between zero and one remain in the same range."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "scaled_ridge = make_pipeline(\n",
    "    MinMaxScaler(),\n",
    "    PolynomialFeatures(degree=2, include_bias=False),\n",
    "    Ridge(alpha=10, solver=\"cholesky\"),\n",
    ")\n",
    "cv_results = cross_validate(\n",
    "    scaled_ridge,\n",
    "    data,\n",
    "    target,\n",
    "    cv=10,\n",
    "    scoring=\"neg_mean_squared_error\",\n",
    "    return_train_score=True,\n",
    "    return_estimator=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_error = -cv_results[\"train_score\"]\n",
    "print(\n",
    "    \"Mean squared error of scaled ridge model on the train set:\\n\"\n",
    "    f\"{train_error.mean():.2e} \u00b1 {train_error.std():.2e}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_error = -cv_results[\"test_score\"]\n",
    "print(\n",
    "    \"Mean squared error of scaled ridge model on the test set:\\n\"\n",
    "    f\"{test_error.mean():.2e} \u00b1 {test_error.std():.2e}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that scaling data has a positive impact on the test error: it is\n",
    "now both lower and closer to the train error. It means that our model is less\n",
    "overfitted and that we are getting closer to the best generalization sweet\n",
    "spot.\n",
    "\n",
    "If you want to try different solvers, you can notice that fitting this\n",
    "pipeline no longer generates any warning regardless of such choice.\n",
    "Additionally, changing the solver should no longer result in significant\n",
    "changes in the weights.\n",
    "\n",
    "Let's have an additional look to the different weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coefs = [est[-1].coef_ for est in cv_results[\"estimator\"]]\n",
    "weights_ridge_scaled_data = pd.DataFrame(coefs, columns=feature_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(8, 10))\n",
    "weights_ridge_scaled_data.plot.box(color=color, vert=False, ax=ax)\n",
    "_ = ax.set(title=\"Ridge regression weights with data scaling\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compared to the previous plots, we see that now most weight magnitudes have a\n",
    "similar order of magnitude, i.e. they are more equally contributing. The\n",
    "number of unstable weights also decreased.\n",
    "\n",
    "In the previous model, we set `alpha=10`. We can now check the impact of\n",
    "`alpha` by increasing it to a very large value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ridge_large_alpha = make_pipeline(\n",
    "    MinMaxScaler(),\n",
    "    PolynomialFeatures(degree=2, include_bias=False),\n",
    "    Ridge(alpha=1_000_000, solver=\"lsqr\"),\n",
    ")\n",
    "cv_results = cross_validate(\n",
    "    ridge_large_alpha,\n",
    "    data,\n",
    "    target,\n",
    "    cv=10,\n",
    "    scoring=\"neg_mean_squared_error\",\n",
    "    return_train_score=True,\n",
    "    return_estimator=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coefs = [est[-1].coef_ for est in cv_results[\"estimator\"]]\n",
    "weights_ridge_scaled_data = pd.DataFrame(coefs, columns=feature_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(8, 10))\n",
    "weights_ridge_scaled_data.plot.box(color=color, vert=False, ax=ax)\n",
    "_ = ax.set(title=\"Ridge regression weights with data scaling and large alpha\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When examining the weight values, we notice that as the `alpha` value\n",
    "increases, the weights decrease. A negative value of `alpha` can lead to\n",
    "unpredictable and unstable behavior in the model.\n",
    "\n",
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p>Here, we only focus on numerical features. For categorical features, it is\n",
    "generally common to omit scaling when features are encoded with a\n",
    "<tt class=\"docutils literal\">OneHotEncoder</tt> since the feature values are already on a similar scale.</p>\n",
    "<p class=\"last\">However, this choice may depend on the scaling method and the user case. For\n",
    "instance, standard scaling categorical features that are imbalanced (e.g. more\n",
    "occurrences of a specific category) would even out the impact of\n",
    "regularization to each category. However, scaling such features in the\n",
    "presence of rare categories could be problematic (i.e. division by a very\n",
    "small standard deviation) and it can therefore introduce numerical issues.</p>\n",
    "</div>\n",
    "\n",
    "In the previous analysis, we chose the parameter beforehand and fixed it for\n",
    "the analysis. In the next section, we check how the regularization parameter\n",
    "`alpha` should be tuned.\n",
    "\n",
    "## Tuning the regularization parameter\n",
    "\n",
    "As mentioned, the regularization parameter needs to be tuned on each dataset.\n",
    "The default parameter does not lead to the optimal model. Therefore, we need\n",
    "to tune the `alpha` parameter.\n",
    "\n",
    "Model hyperparameter tuning should be done with care. Indeed, we want to find\n",
    "an optimal parameter that maximizes some metrics. Thus, it requires both a\n",
    "training set and testing set.\n",
    "\n",
    "However, this testing set should be different from the out-of-sample testing\n",
    "set that we used to evaluate our model: if we use the same one, we are using\n",
    "an `alpha` which was optimized for this testing set and it breaks the\n",
    "out-of-sample rule.\n",
    "\n",
    "Therefore, we should include search of the hyperparameter `alpha` within the\n",
    "cross-validation. As we saw in previous notebooks, we could use a grid-search.\n",
    "However, some predictor in scikit-learn are available with an integrated\n",
    "hyperparameter search, more efficient than using a grid-search. The name of\n",
    "these predictors finishes by `CV`. In the case of `Ridge`, scikit-learn\n",
    "provides a `RidgeCV` regressor.\n",
    "\n",
    "Cross-validating a pipeline that contains such predictors allows to make a\n",
    "nested cross-validation: the inner cross-validation searches for the best\n",
    "alpha, while the outer cross-validation gives an estimate of the testing\n",
    "score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.linear_model import RidgeCV\n",
    "\n",
    "alphas = np.logspace(-7, 5, num=100)\n",
    "ridge = make_pipeline(\n",
    "    MinMaxScaler(),\n",
    "    PolynomialFeatures(degree=2, include_bias=False),\n",
    "    RidgeCV(alphas=alphas, store_cv_results=True),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import ShuffleSplit\n",
    "\n",
    "cv = ShuffleSplit(n_splits=50, random_state=0)\n",
    "cv_results = cross_validate(\n",
    "    ridge,\n",
    "    data,\n",
    "    target,\n",
    "    cv=cv,\n",
    "    scoring=\"neg_mean_squared_error\",\n",
    "    return_train_score=True,\n",
    "    return_estimator=True,\n",
    "    n_jobs=2,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_error = -cv_results[\"train_score\"]\n",
    "print(\n",
    "    \"Mean squared error of tuned ridge model on the train set:\\n\"\n",
    "    f\"{train_error.mean():.2e} \u00b1 {train_error.std():.2e}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_error = -cv_results[\"test_score\"]\n",
    "print(\n",
    "    \"Mean squared error of tuned ridge model on the test set:\\n\"\n",
    "    f\"{test_error.mean():.2e} \u00b1 {test_error.std():.2e}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By optimizing `alpha`, we see that the training and testing scores are close.\n",
    "It indicates that our model is not overfitting.\n",
    "\n",
    "When fitting the ridge regressor, we also requested to store the error found\n",
    "during cross-validation (by setting the parameter `store_cv_results=True`). We\n",
    "can plot the mean squared error for the different `alphas` regularization\n",
    "strengths that we tried. The error bars represent one standard deviation of the\n",
    "average mean square error across folds for a given value of `alpha`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mse_alphas = [\n",
    "    est[-1].cv_results_.mean(axis=0) for est in cv_results[\"estimator\"]\n",
    "]\n",
    "cv_alphas = pd.DataFrame(mse_alphas, columns=alphas)\n",
    "cv_alphas = cv_alphas.aggregate([\"mean\", \"std\"]).T\n",
    "cv_alphas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(8, 6))\n",
    "ax.errorbar(cv_alphas.index, cv_alphas[\"mean\"], yerr=cv_alphas[\"std\"])\n",
    "_ = ax.set(\n",
    "    xscale=\"log\",\n",
    "    xlabel=\"alpha\",\n",
    "    yscale=\"log\",\n",
    "    ylabel=\"Mean squared error\\n (lower is better)\",\n",
    "    title=\"Testing error in RidgeCV's inner cross-validation\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, regularization is just like salt in cooking: one must balance\n",
    "its amount to get the best generalization performance. We can check if the\n",
    "best `alpha` found is stable across the cross-validation fold."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_alphas = [est[-1].alpha_ for est in cv_results[\"estimator\"]]\n",
    "best_alphas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The optimal regularization strength is not necessarily the same on all\n",
    "cross-validation iterations. But since we expect each cross-validation\n",
    "resampling to stem from the same data distribution, it is common practice to\n",
    "choose the best `alpha` to put into production as lying in the range defined\n",
    "by:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    f\"Min optimal alpha: {np.min(best_alphas):.2f} and \"\n",
    "    f\"Max optimal alpha: {np.max(best_alphas):.2f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This range can be reduced depending on the feature engineering and\n",
    "preprocessing.\n",
    "\n",
    "Here is a summary of important points highlighted in this notebook:\n",
    "- scaling features makes the effect of regularization more even: all variables\n",
    "  are regularized by comparable magnitude, which would not necessarily be the\n",
    "  case with the natural feature scales;\n",
    "- scaling features makes the numerical solvers more stable which is also\n",
    "  helpful to tune the regularization parameter more independently of the\n",
    "  choice of the solver used to fit the linear model;\n",
    "- tuning the regularization parameter of the `Ridge` estimator can be done\n",
    "  very efficiently by using the `RidgeCV` class. Wrapping it into a\n",
    "  `cross_validate` call makes it possible to assess the true generalization\n",
    "  power of the whole pipeline by including the tuning of the regularization\n",
    "  parameter as part of the learning process: this is an example of \"nested\n",
    "  cross-validation\";\n",
    "- doing so makes it possible to check that the optimal value of the\n",
    "  regularization strength `alpha` is robust to a resampling of the dataset. If\n",
    "  it wasn't the case it would hint at a problem with the dataset (e.g.\n",
    "  presence of outliers in the features or the target that influence the\n",
    "  learning process disproportionately) or a bad choice of other elements of\n",
    "  the feature engineering pipeline."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}